Mai 20th 2021

Agenda

  1. Introduction
  2. Data gathering & cleaning
  3. Descriptive statistics
  4. Linear Regressions (CPU&GPU)
  5. XGBoost
  6. Points of further consideration

Introduction

The 2020 US-Presidential election led to highest voter turnout in history due to clash of socio-economic groups and ideologies:

Donald Trump Vs. Joe Bidden

Conservative vs. Liberal

Urban vs. Rural

Climate Protectionists vs. Climate Change Deniers

Young vs. old

But are those socio-economic gaps also visible when it comes to the American’s love for big cars?

Research question

Do car characteristics have any predictive power for the US-presidential voting outcome?

Data gathering(1/2)

Two data samples were used

  1. Used Car dataset (Kaggle)

    • 3 million cars xlisted on Cargurus as of Sept. 2020 in 1338/3006 counties
    • Each car reported with 66 characteristics
    • resulting in a total of ~200 million data points

    Total file size of ~9.3GB

  2. Two data sets for the voting outcome on a Precinct level and on a State level [MIT Election Lab]

    • Voting outcome of 1427/3006 counties in 30/50 states
    • Split of votes for Presidential candidates per jurisdiction

    Total size for both files ~0.2GB

Data merging and cleaning (2/2)

Merging:

Problem : county level voting data vs. longitudonale/lattitudonale level car data

Solution: Package ‘jvamisc’ maps latitudinal & longitudinal car data to county

Cleaning approach

  1. Strain Splitting and variable type definition
  2. Visually identified Outliers were excluded with an ff out-of-memory approach
    • city fuel economy < 70 miles per gallon

    • highway fuel economy < 60 miles per gallon

    • Horsepower < 600

    • Price < 200’000 $

    • Mileage < 300’000 miles

    • rpm (revolutions per minute) < 2000

    • Savings Amount < 2500

    • year > 1900

The sample in use (1/2)

Dependent variable:

  • democratic to republican voter outcome

\(\frac{democraticvotes}{democratic votes + republican votes}\)

Independant variables:

  • Is new (if car is new or pre-owned)

  • Price

  • Fuel economy city (fuel consumption in the city)

  • Mileage

  • Horsepower

  • Length

  • Max seating

  • Body type

  • Brand name

  • State

–> Total sample size: 2.6mio observations

Sample in Use (2/2)

Analysis Approaches

Linear Regression

  • on CPU
  • on GPU
    • package ‘GPUtools’ was used to ….
    • CUDA for INVIDIA GPUs (downfall: does not work on other GPUs)

XGBoost Gradient boosted tree-concept deriving predictions from bootstrap aggregation

  • in RAM
  • Out-of-memory
    • Parallelization

Linear regression Coefficients and Robustness

  • Tested for heteroskedasticity and mulitcollinearity

XGBoost parameter importance plot

Train Test

Show table here with R^2 etc. of train and test samples for both methodologies

Results with XGBoost prediction

Show map with final prediction outcome by XGBoost estimator

Sources of Data

XGBoost: Concept

Results with linear regression

Show map with final prediction outcome by linear estimator

Actual vs. predicted values (Linear)

Actual vs. predicted values (XGBoost)

XGBoost number of trees optimization